Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for DISTINCT + ORDER BY in ARRAY_AGG #14413

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

gabotechs
Copy link
Contributor

@gabotechs gabotechs commented Feb 2, 2025

Which issue does this PR close?

Rationale for this change

Completing ARRAY_AGG functionality as a prerequisite for adding the full functionality of STRING_AGG in #14412

What changes are included in this PR?

Adds a Postgres-style support for DISTINCT + ORDER_BY functionality, allowing users to issue statements like:

SELECT ARRAY_AGG(DISTINCT col ORDER BY col) FROM table;

Note that there's a limitation that prohibits ordering by an expression that is not the same as the ARRAY_AGG argument. For example, the following queries are invalid:

SELECT ARRAY_AGG(DISTINCT col ORDER BY other_col) FROM table; ❌
SELECT ARRAY_AGG(DISTINCT col ORDER BY concat(col, '')) FROM table; ❌

This is the same limitation that exists on Postgres, example in Postgres fiddle

Are these changes tested?

yes, both in unit tests and sqllogictests

Are there any user-facing changes?

Users will now be able to issue ARRAY_AGG calls mixing DISTINCT and ORDER_BY clauses

@github-actions github-actions bot added logical-expr Logical plan and expressions sqllogictest SQL Logic Tests (.slt) functions labels Feb 2, 2025
@gabotechs gabotechs force-pushed the array-agg-distinct-order-by branch from 9400d72 to 8eaacd6 Compare February 2, 2025 17:34
@github-actions github-actions bot removed the logical-expr Logical plan and expressions label Feb 2, 2025
@gabotechs gabotechs force-pushed the array-agg-distinct-order-by branch from 8eaacd6 to f23681c Compare February 2, 2025 17:37
@gabotechs gabotechs changed the title Add support for DISTINCT and ORDER BY in ARRAY_AGG Add support for DISTINCT + ORDER BY in ARRAY_AGG Feb 2, 2025
@@ -193,3 +193,149 @@ pub fn merge_ordered_arrays(

Ok((merged_values, merged_orderings))
}

#[cfg(test)]
mod tests {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved this tests from

to this file, as they are just testing the merge_ordered_arrays function present in this file, and nothing related to ARRAY_AGG.

As I added there some unit tests that do test ARRAY_AGG, I though that it might be a good idea to move these ones out to a more suitable place.

Comment on lines -342 to +376
if values.len() != 1 {
return internal_err!("expects single batch");
if values.is_empty() {
return Ok(());
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As now the distinct accumulator can accept more than 1 batch because of the ordering, removing this restriction was necessary.

@alamb
Copy link
Contributor

alamb commented Feb 4, 2025

Close/reopen to rerun CI

@alamb alamb closed this Feb 4, 2025
@alamb alamb reopened this Feb 4, 2025
@gabotechs gabotechs force-pushed the array-agg-distinct-order-by branch from f23681c to 0d53376 Compare February 15, 2025 09:24
// ARRAY_AGG(DISTINCT col ORDER BY other_col) <- Invalid
// ARRAY_AGG(DISTINCT col ORDER BY concat(col, '')) <- Invalid
if acc_args.ordering_req.len() > 1 {
return exec_err!("In an aggregate with DISTINCT, ORDER BY expressions must appear in argument list");

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a copy-paste typo of the error message? Shouldn't it rather be something more like:

return exec_err!("In an aggregate with DISTINCT, only one ORDER BY expression is allowed");

?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The restriction goes further than that: all the expressions used in the ORDER BY must appear in as arguments of the aggregation function. In this case, ARRAY_AGG accepts one argument, therefore only that 1 expression is also valid as an ORDER BY expression.

I took the error message from Postgres, but maybe it's better to customize it to the ARRAY_AGG function.

let mut sort_option: Option<SortOptions> = None;
if let Some(order) = acc_args.ordering_req.first() {
if !order.expr.eq(&acc_args.exprs[0]) {
return exec_err!("In an aggregate with DISTINCT, ORDER BY expressions must appear in argument list");

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since there can be only one ORDER BY expression, the error message should refer to it in singular. I'd suggest:

return exec_err!("In an aggregate with DISTINCT, the ORDER BY expression must match the aggregate argument");

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since there can be only one ORDER BY expression, the error message should refer to it in singular

This was pretty much the error message that Postgres would return, but they have this limitation as a general rule for all aggregation functions, which does not seem to be the case for DataFusion. The other option would be to fully customize the error message for the specific ARRAY_AGG case, which will result in a less standard but more informative error message. Any opinions here @alamb ?

@@ -131,7 +133,32 @@ impl AggregateUDFImpl for ArrayAgg {
let data_type = acc_args.exprs[0].data_type(acc_args.schema)?;

if acc_args.is_distinct {
return Ok(Box::new(DistinctArrayAggAccumulator::try_new(&data_type)?));
// Limitation similar to Postgres. The aggregation function can only mix

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doc comment is great! Would it be possible to make it more user-facing? Maybe as a rustdoc over the ArrayAgg structure? As well as in the ArrayAgg documentation if it exists?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to look for inspiration in other aggregation function docs, and they seem overly brief on purpose, no other doc mentions anything about DISTINCT clause behavior or anything like that. Maybe someone else can chime in and throw some light about what's the best approach here.

Copy link
Contributor

@NGA-TRAN NGA-TRAN left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice @gabotechs
I suggests to add a few more corner and negative tests

@gabotechs gabotechs force-pushed the array-agg-distinct-order-by branch from 0d53376 to dbc8a75 Compare February 20, 2025 15:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
functions sqllogictest SQL Logic Tests (.slt)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

array_agg cannot perform both distinct and order_by
4 participants